In the session 7 (week 8) we discussed data and society: academic and practices discourse on the social, political and ethical aspects of data science, and discussed how one can responsibly carry out data science research on social phenomena, what ethical and social frameworks can help us to critically approach data science practices and its effects on society, and what are ethical practices for data scientists.
Hate crimes a csv file https://fivethirtyeight.com/features/higher-rates-of-hate-crimes-are-tied-to-income-inequality/
OECD Poverty gap a csv file https://data.oecd.org/inequality/poverty-rate.htm
Poverty & Equity Data Portal https://data.oecd.org/inequality/income-inequality.htm#indicator-chart
https://povertydata.worldbank.org/poverty/home/
NHS multiple files The NHS inequality challenge https://www.nuffieldtrust.org.uk/project/nhs-visual-data-challengeÂ
ONS
1. Gender Pay Gap https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/earningsandworkinghours/datasets/annualsurveyofhoursandearningsashegenderpaygaptables 2. Health state life expectancies by Index of Multiple Deprivation (IMD 2015 and IMD 2019): England, all ages multiple publications https://www.ons.gov.uk/peoplepopulationandcommunity/healthandsocialcare/healthinequalities/ datasets/healthstatelifeexpectanciesbyindexofmultipledeprivationimdenglandallages
Indicators - critical reviews The Poverty of Statistics and the Statistics of Poverty https://www.tandfonline.com/doi/full/10.1080/01436590903321844?src=recsys
Indicators in global health arguments: indicators are usually comprehensible to a small group of experts. Why use indicators then? „Because indicators used in global HIV finance offer openings for engagement to promote accountability (...) some indicators and data truly are better than others, and as they were all created by humans, they all can be deconstructed and remade in other forms" Davis, S. (2020). The Uncounted: Politics of Data in Global Health, Cambridge. doi:10.1017/9781108649544
Indicators - conceptualization
https://github.com/fivethirtyeight/data/tree/master/hate-crimes
| Header | Definition |
|---|---|
| state | State name |
| median_household_income | Median household income, 2016 |
| share_unemployed_seasonal | Share of the population that is unemployed (seasonally adjusted), Sept. 2016 |
| share_population_in_metro_areas | Share of the population that lives in metropolitan areas, 2015 |
| share_population_with_high_school_degree | Share of adults 25 and older with a high-school degree, 2009 |
| share_non_citizen | Share of the population that are not U.S. citizens, 2015 |
| share_white_poverty | Share of white residents who are living in poverty, 2015 |
| gini_index | Gini Index, 2015 |
| share_non_white | Share of the population that is not white, 2015 |
| share_voters_voted_trump | Share of 2016 U.S. presidential voters who voted for Donald Trump |
| hate_crimes_per_100k_splc | Hate crimes per 100,000 population, Southern Poverty Law Center, Nov. 9-18, 2016 |
| avg_hatecrimes_per_100k_fbi | Average annual hate crimes per 100,000 population, FBI, 2010-2015 |
import pandas as pd
df = pd.read_excel('hate_Crimes_v2.xlsx')
A reminder: anything with a pd. prefix comes from pandas. This is particulary useful for preventing a module from overwriting inbuilt Python functionality.
Let's have a look at our dataset
df.tail()
type(df)
df.info()
df.info()
The above tables shows that we have some missing data for some of states. See below too.
df.isna().sum()
import numpy as np
np.unique(df.NAME)
There aren't any unexpected values in 'state'.
#using James' code from the last lab: we need the geospatial polygons of the states in America
import geopandas as gpd
import pandas as pd
import altair as alt
geo_states = gpd.read_file('gz_2010_us_040_00_500k.json')
#df = pd.read_excel('hate_Crimes_v2.xlsx')
geo_states.head()
alt.Chart(geo_states, title='US states').mark_geoshape().encode(
).properties(
width=500,
height=300
).project(
type='albersUsa'
)
# Add the data
#should i rename 'state' to 'NAME'?
geo_states = geo_states.merge(df, on='NAME')
geo_states.head()
alt.Chart(geo_states, title='PRE-election Hate crime per 100k').mark_geoshape().encode(
color='avg_hatecrimes_per_100k_fbi',
tooltip=['NAME', 'avg_hatecrimes_per_100k_fbi']
).properties(
width=500,
height=300
).project(
type='albersUsa'
)
alt.Chart(geo_states, title='POST-election Hate crime per 100k').mark_geoshape().encode(
color='hate_crimes_per_100k_splc',
tooltip=['NAME', 'hate_crimes_per_100k_splc']
).properties(
width=500,
height=300
).project(
type='albersUsa'
)
import seaborn as sns
sns.pairplot(data = df.iloc[:,1:])
df.boxplot(column=['median_household_income'])
df.boxplot(column=['avg_hatecrimes_per_100k_fbi'])
We may want to drop columns (remove them). Details are here.
Let us drop Hawaii.
df[df.NAME == 'Hawaii']
df = df.drop(df.index[11])
df.describe()
df.plot(x = 'avg_hatecrimes_per_100k_fbi', y = 'median_household_income', kind='scatter')
df.plot(x = 'hate_crimes_per_100k_splc', y = 'median_household_income', kind='scatter')
df[df.hate_crimes_per_100k_splc > (np.std(df.hate_crimes_per_100k_splc) * 2.5)]
import matplotlib.pyplot as plt
outliers_df = df[df.hate_crimes_per_100k_splc > (np.std(df.hate_crimes_per_100k_splc) * 2.5)]
df.plot(x = 'hate_crimes_per_100k_splc', y = 'median_household_income', kind='scatter')
plt.scatter(outliers_df.hate_crimes_per_100k_splc, outliers_df.median_household_income ,c='red')
df_pivot = df.pivot_table(index=['NAME'], values=['hate_crimes_per_100k_splc', 'avg_hatecrimes_per_100k_fbi', 'median_household_income'])
df_pivot
##sort by values
#df_pivot = pd.pivot_table(df, index=['state'], columns = ['hate_crimes_per_100k_splc'], fill_value=0)
#df_pivot
#df2 = df_pivot.reindex(df_pivot['hate_crimes_per_100k_splc'].sort_values(by='hate_crimes_per_100k_splc', ascending=False).index)
df_pivot.sort_values(by=['avg_hatecrimes_per_100k_fbi'], ascending=False)
#This is code for standarization
from sklearn import preprocessing
import numpy as np
#Get column names first
#names = df.columns
#df_stand = df[['median_household_income','share_unemployed_seasonal']]
df_stand = df[['median_household_income','share_unemployed_seasonal', 'share_population_in_metro_areas'
, 'share_population_with_high_school_degree', 'share_non_citizen', 'share_white_poverty', 'gini_index'
, 'share_non_white', 'share_voters_voted_trump', 'hate_crimes_per_100k_splc', 'avg_hatecrimes_per_100k_fbi']]
names = df_stand.columns
#Create the Scaler object
scaler = preprocessing.StandardScaler()
#Fit your data on the scaler object
df2 = scaler.fit_transform(df_stand)
df2 = pd.DataFrame(df2, columns=names)
df2.tail()
ax = sns.boxplot(data=df2, orient="h")
#wanted to remove row with Hawaii (row nr 11) following https://chrisalbon.com/python/data_wrangling/pandas_dropping_column_and_rows/
df2 = df.copy()
df2
#df2.drop('Hawaii')
#df2.drop(11) #drop Hawaii row
df2.drop(df.index[11])
df2.tail()
import scipy.stats
#instead of running it one by one for every pair of variables, like:
#scipy.stats.pearsonr(st_wine.quality.values, st_wine.alcohol.values)
corrMatrix = df2.corr().round(2)
print (corrMatrix)
import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt
corrMatrix = df2.corr().round(1) #I added here ".round(1)" so that's easier to read given number of variables
sn.heatmap(corrMatrix, annot=True)
plt.show()
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn import metrics
x = df2[['median_household_income', 'share_population_with_high_school_degree', 'share_voters_voted_trump']]
y = df2[['avg_hatecrimes_per_100k_fbi']]
#what if we change the y variable
#y = df2[['hate_crimes_per_100k_splc']]
est = LinearRegression(fit_intercept = True)
est.fit(x, y)
print("Coefficients:", est.coef_)
print ("Intercept:", est.intercept_)
model = LinearRegression()
model.fit(x, y)
y_hat = model.predict(x)
print ("MSE:", metrics.mean_squared_error(y, y_hat))
print ("R^2:", metrics.r2_score(y, y_hat))
print ("var:", y.var())